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Abstract —Linear regression is arguably the most prominent 
among statistical Inference methods, popular both for its sim¬ 
plicity as well as its broad applicability. On par with data- 
intensive applications, the sheer size of linear regression problems 
creates an ever growing demand for quick and cost efficient 
solvers. Fortunately, a significant percentage of the data ac- 
crned can be omitted while maintaining a certain qnality of 
statistical inference with an affordable computational budget. 
The present paper Introdnces means of identifying and omitting 
“less informative” observations in an online and data-adaptlve 
fashion, bnilt on principles of stochastic approximation and 
data censoring. First- and second-order stochastic approximation 
maximum likelihood-based algorithms for censored observations 
are developed for estimating the regression coefficients. Online 
algorithms are also put forth to reduce the overall complexity by 
adaptively performing censoring along with estimation. The novel 
algorithms entail simple closed-form npdates, and have provable 
(non)asymptotlc convergence guarantees. Fnrthermore, specific 
rules are Investigated for tuning to desired censoring patterns and 
levels of dimensionality reduction. Simnlated tests on real and 
synthetic datasets corroborate the efficacy of the proposed data- 
adaptlve methods compared to data-agnostlc random projection- 
based alternatives. 

I. Introduction 

Nowadays omni-present monitoring sensors, search engines, 
rating sites, and Internet-friendly portable devices generate 
massive volumes of typically dynamic data m. The task of 
extracting the most informative, yet low-dimensional structure 
from high-dimensional datasets is thus of utmost importance. 
Fast-streaming and large in volume data, motivate well updat¬ 
ing analytics rather than re-calculating new ones from scratch, 
each time a new observation becomes available. Redundancy 
is an attribute of massive datasets encountered in various ap¬ 
plications ||2|, and exploiting it judiciously offers an effective 
means of reducing data processing costs. 

In this regard, the notion of optimal design of experiments 
has been advocated for reducing the number of data required 
for inference tasks 0. In recent works, the importance of 
sequential optimization along with random sampling of Big 
Data has been highlighted m. Specifically for linear re¬ 
gressions, random projection (RP)-based methods have been 
advocated for reducing the size of large-scale least-squares 
(LS) problems 0, 0, i). As far as online alternatives, the 
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randomized Kaczmarz’s (a.k.a. normalized least-mean-squares 
(LMS)) algorithm generates a sequence of linear regression 
estimates from projections onto convex subsets of the data Q, 
0, 13. Sequential optimization includes stochastic approxi¬ 
mation, along with recent advances on online learning noi. 
Frugal solvers of (possibly sparse) linear regressions are avail¬ 
able by estimating regression coefficients based on (severely) 
quantized data uni, lEl; see also ini for decentralized sparse 
LS solvers. 

In this context, the idea here draws on interval censoring 
to discard “less informative” observations. Censoring emerges 
naturally in several areas, and batch estimators relying on 
censored data have been used in econometrics, biometrics, 
and engineering tasks m, including survival analysis na, 
saturated metering Ifia . and spectrum sensing |[T3. It has 
recently been employed to select data for distributed estima¬ 
tion of parameters and dynamical processes using resource- 
constrained wireless sensor networks, thus trading off perfor¬ 
mance for tractability mi, mi, iMi. These works confirm 
that estimation accuracy achieved with censored measurements 
can be comparable to that based on uncensored data. Hence, 
censoring offers the potential to lower data processing costs, 
a feature certainly desirable in Big Data applications. 

To this end, the present work employs interval censoring for 
large-scale online regressions. Its key novelty is to sequentially 
test and update regression estimates using censored data. Two 
censoring strategies are put forth, each tailored for mitigating 
different costs. In the first one, stochastic approximation 
algorithms are developed for sequentially updating the regres¬ 
sion coefficients with low-complexity first- or second-order 
iterations to maximize the likelihood of censored and uncen¬ 
sored observations. This strategy is ideal when the number of 
observations are to be reduced, in order to lower the cost of 
storage or transmission to a remote estimation site. Relative 
to ifTSl . lfT9l . the contribution here is a novel online scheme 
that greatly reduces storage requirements without requiring 
feedback from the estimator to sensors. Error bounds are 
derived, while simulations demonstrate performance close to 
estimation error limits. 

The second censoring strategy focuses on reducing the com¬ 
plexity of large-scale linear regressions. The proposed methods 
are also online by design, but may also be readily used to 
reduce the complexity of solving a batch linear regression 
problem. The difference with dimensionality-reducing alter¬ 
natives, such as optimal design of experiments, randomized 
Kaczmarz’s and RP-based methods, is that the introduced 
technique reduces complexity in a data-driven manner. 
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The rest of the paper is as follows. A formal problem 
description is in Section [III while the two censoring rules are 
introduced in Section III-AI First- and second-order stochas¬ 
tic approximation maximum-likelihood-based algorithms for 
censored observations are developed in Section Hn] along 
with threshold selection rules for controlled data reduction 
in Section IIII-BI Adaptive censoring algorithms for reduced- 
complexity linear regressions are in Section |IV] with corre¬ 
sponding threshold selection rules given in Section IIV-CI and 
robust versions of the algorithms outlined in Section lTV-DI The 
proposed online-censoring and reduced-complexity methods 
are tested on synthetic as well as real data, and compared 
with competing alternatives in Section |V] Finally, concluding 
remarks are made in Section IVTl 

Notation. Lower- (upper-) case boldface letters denote col¬ 
umn vectors (matrices). Calligraphic symbols are reserved 
for sets, while symbol ^ stands for transposition. Vectors 
0, 1, and e„ denote the all-zeros, the all-ones, and the n- 
th canonical vector, respectively. Notation Af{ni, C) stands 
for the multivariate Gaussian distribution with mean m and 
covariance matrix C. The £i- and (' 2 -norms of a vector 
y g are de fined as ||y||i := EiLi |2/(*)I and ||y ||2 := 

VEfci respectively; <()(() := (l/v^)exp(-f2/2) 

denotes the standardized Gaussian probability density function 
(pdf), and Q{z) := the associated complementary 

cumulative distribution function. Finally, for a matrix X let 
tr(X), Aniin(X) and Amax(X) denote the trace, minimum and 
maximum eigenvalue, respectively. 

II. Problem Statement and Preliminaries 

Consider a p x 1 vector of unknown parameters do gener¬ 
ating scalar streaming observations 

Vn — ^ o Vm 12 — 1,2,...,fA (1) 

where x„ is the n-th row of the D x p regression matrix X, 
and the noise samples Vn are assumed independently drawn 
from AA(0, cr^). The high-level goal is to estimate do in an 
online fashion, while meeting minimal resource requirements. 
The term resources here refers to the total number of utilized 
observations and/or regression rows, as well as the overall 
computational complexity of the estimation task. Furthermore, 
the sought data-and complexity-reduction schemes are desired 
to be data-adaptive, and thus scalable to the size of any given 
dataset {?/„, x.n\n=i- To meet such requirements, the proposed 
first- and second-order online estimation algorithms are based 
on the following two distinct censoring methods. 


A. NAC and AC Rules 

A generic censoring rule for the data in ([T]i is given by 




* 7 Vn € 

yn , Otherwise 


n=l,...,D (2) 


where * denotes an unknown value when the n-th datum has 
been censored (thus it is unavailable) - a case when we only 
know that g C„ for some set otherwise, the actual 
measurement ?/„ is observed. Given {zn, x.n}n=i^ '^he goal is to 
estimate do- Aiming to reduce the cost of storage and possible 


transmission, it is prudent to rely on innovation-based interval 
censoring of y„. To this end, define per time n the binary 
censoring variable c„ = 1 if !/„ g C„; and zero otherwise. 
Each datum is decided to be censored or not using a predictor 
ijn formed using a preliminary (e.g., LS) estimate of do as 

dK = (3) 


from K > p measurements {K <C D) collected in jk, 
and the corresponding K x p regression matrix X^. Given 
i/n = x^dx, the prediction error i/„ := i/„ — quantifies the 
importance of datum n in estimating do- The latter motivates 
what we term non-adaptive censoring (NAC) strategy; 


{z^n^ Cn) 


(2/n,0) 

(Ll) 


y„-xl^9K 


if 


, otherwise 


> Tn 


(4) 


where {Tn}n=i are censoring thresholds, and as in ©, * 
signifies that the exact value of is unavailable. The rule (HI 
censors measurements whose absolute normalized innovation 
is smaller than r^; and it is non-adaptive in the sense that 
censoring depends on dK that has been derived from a fixed 
subset of K measurements. Clearly, the selection of {Tn}n=i 
affects the proportion of censored data. Given streaming data 
{Zn,Cn,Xn}, the next section will consider constructing a 
sequential estimator of do from censored measurements. 

The efficiency of NAC in (|4| in terms of selecting informa¬ 
tive data depends on the initial estimate dx- A data-adaptive 
alternative is to take into account all censored data {x^, Zi\"^~^ 
available up to time n. Predicting data through the most recent 
estimate defines our data-adaptive censoring (AC) rule: 


Cn) 


(2/n,0) 

(*,1) 


, if 


yn—X^On-l 

a 


, otherwise 


> A 


(5) 


In Section IIVI (|5]l will be combined with first- and second- 
order iterations to perform joint estimation and censoring 
online. Implementing the AC rule requires feeding back dn-\ 
from the estimator to the censor, a feature that may be 
undesirable in distributed estimation setups. Nonetheless, in 
centralized linear regression, AC is well motivated for reduc¬ 
ing the problem dimension and computational complexity. 


III. Online Estimation with NAC 
Since noise samples {vn}n=i in O ^6 independent and (01 
applies independently over data, {zn,Cn}n=i independent 
too. With zjj ■= [zi,..., Zd]'^ and cjj := [ci,..., cd]^, the 
joint pdf is p(zd, cd; 0) = l\^^iP{zn, c„; d) with 

p{zn, C„; d) = [Af {zn, x^e, CT^)] ^ [Pr{c„ = 1}]'"" (6) 

since c„ = 0 means no censoring, and thus = j/„ is 
Gaussian distributed; whereas c„ = 1 implies \yn — yn\ < t„(T, 
that is Pr{c„ = 1} = Pr{y„ - r^cr - x^do < < 

ijn + T„cr — x'^do}, and after recalling that Vn is Gaussian 

Pr{c„ = i} = g(4(e))-Q(z“(0)) 

where zl{d) := -r„ - and z“(0) := r„ - 

Then, the maximum-likelihood estimator (MLE) of Oq is 

D 

d = argmm Coid) := ^ £n{d) (7) 

n—1 
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Algorithm 1 Stochastic Approximation (SA)-MLE 

Initialize Oq as the LSE 9k in (O. 
for n = 1 : D do 

Measurement is possibly censored using Q. 
Estimator receives [zm Cn,Xn). 

Parameter 9 is updated via (HJ and 

end for 


where functions £„(0) are given by (cf. (|6l)) 

4(6') := ^ (y„ -x^6/)^-c„log [Q (z^(9)) -Q(z^(9))] . 

If the entire dataset {zn,Cn,Xn}^^i were available, the MLE 
could be obtained via gradient descent or Newton iterations. 

Considering Big Data applications where storage resources 
are scarce, we resort to a stochastic approximation solution and 
process censored data sequentially. In particular, when datum 
n becomes available, the unknown parameter is updated as 


Algorithm 2 Second-order SA-MLE 

Initialize 9q as the LSE 9k in (|3]l. 

Initialize Cq = a^(X.]^XK)~^- 
for n = 1 : Z? do 

Measurement is possibly censored using @. 
Estimator receives (zn,Xn,Cn)- 
Compute 7„(0„_i) from (fl^ . 

Update matrix step size from (fT^ . 

Update parameter estimate as in (fTTIi . 
end for 


A. Second-Order SA-MLE 

If extra complexity can be afforded, one may consider 
incorporating second-order information in the SA-MLE update 
to improve its performance. In practice, this is possible by 
replacing scalar with matrix step-sizes M„. Thus, the first- 
order stochastic gradient descent (SGD) update in (O is 
modified as follows 


■— ^n — 1 l) 


( 8 ) 


0„:=0„_i-M-ig„(0„_i). (11) 


for a step size > 0, and with gn(9) = /3n(9)xn denoting 
the gradient of 6^(0), where 


M0) ■■= 


^ ^ Q(z-(9))-Q(zj,(9))- ^ ^ 


The overall scheme is tabulated as Algorithm [T] 

Observe that when the n-th datum is not censored (c„ = 0), 
the second summand in the right-hand side (RHS) of (|9]l 
vanishes, and ® reduces to an ordinary LMS update. When 
Cn = 1, the first summand disappears, and the update in ® 
exploits the fact that the unavailable pn lies in a known interval 
(|j/„ — xJ^9k\ < TnCr), information that would have been 
ignored by an ordinary LMS algorithm. 

Since the SA-MLE is in fact a Robbins-Monroe iteration 
on the sequence {g(0)}^=i, it inherits related convergence 
properties. Specifically, by selecting = l/(nM) (for an 
appropriate M), the SA-MLE algorithm is asymptotically 
efficient and Gaussian ETl pg. 197]. Performance guarantees 
also hold with finite samples. Indeed, with D finite, the regret 
attained by iterates {9n} against a vector 9 is defined as 


D 

R{D) := [4(0n) - 4(0)] . (10) 

n—1 

Selecting p, properly. Algorithm [T] can afford bounded regret 
as asserted next; see Appendix for the proof. 

Proposition 1. Suppose ||x „||2 < x and |/3n(0)| < P for 
n = 1,..., D, and let 9* be the minimizer of (I7]l. By choosing 
p, = II0* —0^^112/(4279^5;), the regret of the SA-MLE satisfies 


R{D) < VW\\9* -9k\\2xP . 


Proposition [T] assumes bounded x„’s and noise. Although 
the latter is not satisfied by e.g., the Gaussian distribution, 
appropriate bounds ensure that ([T]i holds with high probability. 


When solving miug E[4(0)] using a second-order SA it¬ 
eration, a desirable Newton-like matrix step size is M„ = 
E[V^4(0n)]- Given that the latter requires knowing the av¬ 
erage Hessian that is not available in practice, it is commonly 
surrogated by its sample-average (1/u-) ^^^*(^0 ED- 

To this end, note first that V^4(0) = 7rt(0)x„x^, where 


ln{0) := 


(1 


( f{zl{9)) -(j){zi{9)) \ 
\Q{zl{9))-Q{zi{9))j 


zl{9ct>{zm) - zl{94>{zl{9)) 

Q (4(0))-Q (4(0)) 


( 12 ) 


Due to the rank-one update M^, = ((n — l)/n)M„_i -f 
(l/n)7„_i(0„_i) x„_ix4i, the mattix step size C„ := 
can be obtained efficiently using the matrix inversion 
lemma as 

Q _ ^ (Q, _ C„_iX„x^C„_i _\ 

^ (n-l)7^^(0„_i)-f x^C„_iX„/ 

(13) 

Similar to its first-order counterpart, the algorithm is ini¬ 
tialized by the preliminary estimate 0o = 9k, and Co = 
cr^( X^Xk)-i. The second-order SA-MLE method is sum¬ 
marized as Algorithm |2l while the numerical tests of Sec¬ 
tion confirm its faster convergence at the cost of 0{p^) 
complexity per update. 


B. Controlling Data Reduction via NAC 

To apply the NAC rule of (|4]i for data reduction at a 
controllable rate, a relation between thresholds {t„} and the 
censoring rate must be derived. Eurthermore, prior knowledge 
of the problem at hand (e.g., observations likely to contain 
outliers) may dictate a specific pattern of censoring probabili¬ 
ties {tt* l^i- If d is the number of uncensored data after NAC 
is applied on a dataset of size D > d, then {D — d)/D is the 
censoring ratio. Since {y„} are generated randomly according 
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to ([TJ, it is clear that d is itself a random variable. The analysis 
is thus focused on the average censoring ratio 


c := E 


'D-d' 

D 


D D 

— ^ E[c„] = 


(14) 


where 7r„ Pr(c„ = 1) is the probability of censoring datum 
n, that as a function of is given by [cf. ®] 


TTniTn) = Pr{ -T„Cr < Ijn - Vn < r„Cr} 


= Pr{-r„ < 


xliOo - e 


K) 


< T-„}. (15) 


By the properties of the LSE, 9k ^ A/'(0o, cr^(X^Xi<') ^), 
it follows that 


x^(0o - Ok) + ' 


'Af{0,Kl{XlXK)-^^n + l) ■ 


Thus, the censoring probabilities in (flSl l simplify to 

= 1-2Q (t„ [x^(X^Xx)-1x„ + l]”'/') . (16) 
Solving (fThl l for r„, one arrives for a given tt* = 7r„(r*) at 

< = [x^(X^X^)-1x„ + 1]'/"q-i 


Hence, for a prescribed c, one can select a desired censoring 
probability pattern {'Kn}n=i satisfy dl, and corresponding 
{'’'n}n=i ill accordance with ([TtT i. 

The threshold selection dnii requires knowledge of all 
{'yi.n}n=i- in addition, implementing (fTTl l for all D observa¬ 
tions, requires 0{Dp^) computations that may not be afford¬ 
able for D ^ p.To deal with this, the ensuing simple threshold 
selection rule is advocated. Supposing that {:x.n}n=i gener¬ 
ated i.i.d. according to some unknown distribution with known 
first- and second-order moments, a relation between a target 
common censoring probability tt* and a common threshold 
T can be obtained in closed form. Assume without loss of 
generality that E [x„] = 0, and let E [x„x^] = R^; and 
Ck •= (^o — dK)lcr ^ A/’(0, (X^Xx)“^). For sufficiently 
large K, it holds that (X]^Xk)~^ ~ R~^/iT, and thus ^k 
JV(0, R^^/K). Next, using the standardized Gaussian random 
vector u ^ Af(0, Ip), one can write (^k — R-^/"u//A. 
Also, with an independent zero-mean random vector u„ with 
E[u„u^] = Ip, it is also possible to express x„ = R^^Un, 
which implies By the central limit 

theorem (CLT), u^u converges in distribution to J\f{0,p) 
as the inner dimension of the two vectors p grows; thus, 
^uCk ^ Af(0,p/Ar). Under this approximation, it holds that 


v'p/'f+ i) 

As expected, due to the normalization by a in ®, tt does not 
depend on a. Interestingly, it does not depend on R^; either. 
Having expressed tt as a function of r, the latter can be tuned 
to achieve the desirable data reduction. Following the law of 
large numbers and given parameters p and K, to achieve an 


average censoring ratio of c = tt* = {D — d)/D, the threshold 
can be set to 

T = ^i + p/K . (19) 

Figure |l(a)| depicts tt as a function of r for p = 100 and 
K = 200. Function (fTSl l is compared with the simulation- 
based estimate of 7r„ using 100 Monte Carlo runs, confirming 
that ( fTST i offers a reliable approximation of tt, which improves 
as p grows. However, for the approximation ( X^X^)-i « 
R^^/iT to be accurate, K should be large too. Figure [T(b)| 
shows the probability of censoring for varying K with fixed 
p = 100 and r = 1. Approximation (fTSl) yields a reliable value 
for TT for as few as K tv 200 preliminary data. 



(a) 



K 


(b) 

Fig. 1. a) Censoring probability for varying threshold (p = 100, K = 200). 
b) Censoring probability for varying K (p = 100, r = 1). 


IV. Big Data Streaming Regression with AC 
The NAC-based algorithms of Section |III] emerge in a wide 
range of applications for which censoring occurs naturally as 
part of the data acquisition process; see e.g., the Tobit model 
in economics ifiTl . and survival data analytics in lITSl . Apart 
from these applications where data are inherently censored, 
our idea is to employ censoring deliberately for data reduction. 
Feveraging NAC for data reduction decouples censoring from 
estimation, and thus eliminates the need for obtaining further 
information. However, one intuitively expects improved per¬ 
formance with a joint censoring-estimation design. 

In this context, first- and second-order sequential algorithms 
will be developed in this section for the AC in Q. Instead 
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of 9k, AC is performed using the latest estimate of 9. 
Apart from being effective in handling streaming data, AC 
can markedly lower the complexity of a batch LS problem. 
Section IIV-AI introduces an AC-based LMS algorithm for 
large-scale streaming regressions, while Section IIV-BI puts 
forth an AC-based recursive least-squares (RLS) algorithm as 
a viable alternative to random projections and sampling. 


A. AC-LMS 

A hrst-order AC-based algorithm is presented here, inspired 
by the celebrated LMS algorithm. Originally developed for 
adaptive filtering, LMS is well motivated for low-complexity 
online estimation of (possibly slow-varying) parameters. Given 
( 2 /„,x„), LMS entails the simple update 


9n — 9n — l “f n — l) (20) 


where e„(0) := yn — can be viewed as the innovation of 
yn, since = x^0„_i is the prediction of yn given e„_i. 
LMS can be regarded as an SGD method for ming E[/„(0)], 
where the instantaneous cost is fn{9) = e^(0)/2. 

To derive a hrst-order method for online censored regres¬ 
sion, consider minimizing E[/n^^(0)] with the instantaneous 
cost selected as the truncated quadratic function 






, |en(0)| > T„cr 
, \en{0)\ < Tn(T 


( 21 ) 


for a given Tji > 0. For the sake of analysis, a common thresh¬ 
old will be adopted; that is, Tn = t Vn. The truncated cost can 
be also expressed as fn\9) = max{0, (e^(0) — t^(t^)/2}. 
Being the pointwise maximum of two convex functions, 
fn\9) is convex, yet not everywhere differentiable. From 
standard rules of subdifferential calculus, its subgradient is 

{ -x„e„(0) , \en{9)\ > TO 

0 , |e„(6/)j < TO 

{-(^x„e„(0) : 0 < < 1} , |e„(0)| = to 

An SGD iteration for the instantaneous cost in (1211 1 with t„ = 
r, performs the following AC-LMS update per datum n 


9n := 


dn-1 


, |e„(0„_i)| > TO 
, otherwise 


( 22 ) 


where y > 0 can be either constant for tracking a time-varying 
parameter, or, diminishing over time for estimating a time- 
invariant 9o. Different from SA-MLE, the AC-LMS does not 
update 9 if datum n is censored. The intuition is that if yn can 
be closely predicted by yn '■= x^0„_i, then (jjn,^n) can be 
censored (small innovation is indeed ‘not much informative’). 
Extracting interval information through a likelihood function 
as in Algorithm [T] appears to be challenging here. This is 
because unlike NAC, the AC data {zn}n=i dependent 
across time. 

Interestingly, upon invoking the “independent-data assump¬ 
tion” of SA m\. following the same steps as in Section nn 
and substituting 9k = 9n-i into (|9]l, the interval information 
term is eliminated. This is a strong indication that interval 
information from censored observations may be completely 
ignored without the risk of introducing bias. Indeed, one of the 


implications of the ensuing Proposition |2] is that the AC-LMS 
is asymptotically unbiased. Essentially, in AC-LMS as well 
as in the AC-RLS to be introduced later, both x„ and j/„ are 
censored - an important feature effecting further data reduction 
and lowering computational complexity of the proposed AC 
algorithms. The mean-square error (MSE) performance of AC- 
LMS is established in the next proposition proved in the 
Appendix. 


Proposition 2. Assume x„ ’i are generated i.i.d. with E [x, 
0, E [x„xj] = Rj,, E [x^x„x^] = and E (x„x^)^ 


R^, while observations yn are obtained according to model 
dU- For a diminishing /i„ = y/n with y = 2/a, initial 
estimate 9i, and censoring-controlling threshold t, the AC- 
LMS in (122b yields an estimate 9n with MSE bounded as 


E[||0„-0o||^] < 




I6»i -9o\\i + ^ 


A 

l2 


8 A log n 


where a := 2(3(r)Amin(R-a;). A 2tr(R2:)cr^(l — Q(t) 
-\-tp{t)), and := Amax (R^)- Further, for y < a/(16L^), 
AC-LMS converges exponentially to a bounded error 


E [||6>„ - 9o\\l] < 2 exp (- - 4LV") n - 4LV") 

Proposition asserts that AC-LMS achieves a bounded 
MSE. It also links MSE with the AC threshold r that can 
be used to adjust the censoring probability. Closer inspection 
reveals that the MSE bound decreases with r. In par with 
intuition, lowering t allows the estimator to access more 
data, thus enhancing estimation performance at the price of 
increasing the data volume processed. 


B. AC-RLS 


A second-order AC algorithm is introduced here for the 
purpose of sequential estimation and dimensionality reduction. 
It is closely related to the RLS algorithm, which per time n 
implements the updates; see e.g., Il23l 


C„ = 


n — 1 


Cn-l- 


^n—l^n^n ^n—1 

n- 1 -\- xlCn-lXn 


— ^n—1 F Cn^n{yn X 0^_i) 
n 


(23a) 

(23b) 


where C„ is the sample estimate for RjT^ and is typically 
initialized to Cq = el, for some small positive e, e.g., Il24l . 
The RLS estimate at time n can be also obtained as 


9n = argmm^ (y* - xj9f -f e||e|| 2 . (24) 

i=l 


The bias introduced by the arbitrary choice of Cq vanishes 
asymptotically in n, while the RLS iterates converge to the 
batch LSE. RLS can be viewed as a second-order SGD method 
of the form — M“^V/„(0„_i) for the quadratic 

cost fn{9) = e^{9)l2. In this instance of SGD, the ideal 
matrix step size M„ = E[V2/„(0„_i)] = E [(1 - c„)x„x^] 
is replaced by its running estimate (l/n)C~^; see e.g., Il22l . 
















6 


Algorithm 3 Adaptive-Censoring (AC)-RLS 
Initialize 0o = 0 and Cq = el. 
for n=l : D do 

if \yn - > rcr then 

Estimator receives while c„ = 0. 

Update inverse sample covariance from (I25ab . 
Update estimate from (I25bb . 
else 

Estimator receives no information (c„ = 1). 
Propagate inverse covariance as C„ = 
Preserve estimate = On-i- 

end if 
end for 


To obtain a second-order counterpart of AC-LMS, we 
replace the quadratic instantaneous cost of RES with the 
truncated quadratic in (EB. The matrix step-size is further 
surrogated by 

1 ^ 77/1 1 

M„ = - ^(1 - Ci)xixf = - -M„_i -f -(1 - c„)x„x^. 

n n n 


Applying the matrix inversion lemma to find ^ yields the 
next AC-RLS updates 


C„ = 


n — 1 


C„_i- 


(1 Cyi)C/yj_lXj2X^ Cti—I 


n- l+xICn-lX,! 


(25a) 


9n = 0n-l + - - —CnXniVn “ X^0„_i) (25b) 

n 

where is decided by 0. Eor c„ = 1, the parameter 
vector is not updated, while costly updates of C„ are also 
avoided. In addition, different from the iterative expectation- 
maximization algorithm in HD, AC-RLS skips completely 
covariance updates. Its performance is characterized by the 
following proposition shown in the Appendix. 


Proposition 3. If 'x.n’s are i.i.d. with E [x„] = 0 and 
E [x„x^] = Ra;, while observations yn adhere to the model 
in ([B, then for 6i — 0 and constant r, there exists k > 0 
such that AC-RLS estimates On yield bounded MSE 


ltr(R-i)a2<E[||0„ 



< 


1 tr (R^^) cr^ 
n 2 Q{t) 


Vn > k. 


As corroborated by Proposition [B the AC-RLS estimates 
are guaranteed to converge to Og for any choice of t. Overall, 
the novel AC-RLS algorithm offers a computationally-efficient 
and accurate means of solving large-scale LS problems en¬ 
countered with Big Data applications. 

At this point, it is useful to contrast and compare AC- 
RLS with RP and random sampling methods that have been 
advocated as fast LS solvers ll25l . 0 . In practice, RP-based 
schemes first premultiply data (y, X) with a random matrix 
R = HD, where H is a D X D Hadamard matrix and 
D is a diagonal matrix whose diagonal entries take values 
{ —l/v77,+1 /a/D} equiprobably. Intuitively, R renders all 
rows of “comparable importance” (quantified by the leverage 
scores 1251 . 0). so that the ensuing random matrix 
exhibits no preference in selecting uniformly a subset of d 


rows. Then, the reduced-size LS problem can be solved as 
= argming ||SdHD(y — X0)||2. Lor a general precondi¬ 
tioning matrix HD, computing the products HDy and HDX 
requires a prohibitive number of 0{D‘^p) computations. This 
is mitigated by the fact that H has binary — 1} entries and 

thus multiplications can be implemented as simple sign flips. 
Overall, the RP method reduces the computational complexity 
of the LS problem from 0{Dp^) to o{Dp^) operations. 

By setting r = Q~^{d/{2D)), our AC-RLS Algorithm [B 
achieves an average reduction ratio d/D by scanning the ob¬ 
servations, and selecting only the most informative ones. The 
same data ratio can be achieved more accurately by choosing 
a sequence of data-adaptive thresholds {Tn}n=i^ described 
in the next subsection. As will be seen in Section IV-Cl AC- 
RLS achieves significantly lower estimation error compared to 
RP-based solvers. Intuitively, this is because unlike RPs that 
are based solely on X and are thus observation-agnostic, AC 
extracts the most informative in terms of innovation subset of 
rows for a given problem instance (y, X). 

Regarding the complexity of AC-RLS, if the pair ( 2 /„,x„) 
is not censored, the cost of updating On and C„ is 0{p^ ) mul¬ 
tiplications. Lor a censored datum, there is no such cost. Thus, 
for d uncensored data the overall computational complexity is 
0{dp^). Lurthermore, evaluation of the absolute normalized 
innovation requires 0{p) multiplications per iteration. Since 
this operation takes place at each of the D iterations, there are 
0{Dp) computations to be accounted for. Overall, AC-RLS re¬ 
duces the complexity of LS from 0{Dp^) to 0{dp'^)-\-0{Dp). 
Evidently, the complexity reduction is more prominent for 
larger model dimension p. Lor p ^ 1, the second term may 
be neglected, yielding an 0{dp^) complexity for AC-RLS. 

A couple of remarks are now in order. 

Remark 1. The novel AC-LMS and AC-RLS algorithms 
bear structural similarities to sequential set-membership (SM)- 
based estimation GB, GB. However, the model assumptions 
and objectives of the two are different. SM assumes that the 
noise distribution in ([B has bounded support, which implies 
that Og belongs to a closed set. This set is sequentially iden¬ 
tified by algorithms interpreted geometrically, while certain 
observations may be deemed redundant and thus discarded by 
the SM estimator. In our Big Data setup, an SA approach is 
developed to deliberately skip updates of low importance for 
reducing complexity regardless of the noise pdf. 

Remark 2. Estimating regression coefficients relying on 
“most informative” data is reminiscent of support vector 
regression (SVR), which typically adopts an e-insensitive cost 
(truncated £i error norm). SVR has well-documented merits 
in robustness as well as generalization capability, both of 
which are attractive for (even nonlinear kernel-based) pre¬ 
diction tasks Gil- Solvers are typically based on nonlinear 
programming, and support vectors (SVs) are returned after 
batch processing that does not scale well with the data size. 
Inheriting the merits of SVRs, the novel AC-LMS and AC- 
RLS can be viewed as returning “causal SVs,” which are 
different from the traditional (non-causal) batch SVs, but 
become available on-the-fly at complexity and storage require¬ 
ments that are affordable for streaming Big Data. In fact, we 
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conjecture that causal SVs returned by AC-RLS will approach 
their non-causal SVR counterparts if multiple passes over 
the data are allowed. Mimicking SVR costs, our AC-based 
schemes developed using the truncated £2 cost [cf. (ISlT il can be 
readily generalized to their counterparts based on the truncated 
£i error norm. Cross-pollinating in the other direction, our 
AC-RLS iterations can be useful for online support vector 
machines capable of learning from streaming large-scale data 
with second-order closed-form iterations. 


C. Controlling Data Reduction via AC 

A clear distinction between NAC and AC is that the latter 
depends on the estimation algorithm used. As a result, thresh¬ 
old design rules are estimation-driven rather than universal. 
In this section, threshold selection strategies are proposed for 
AC-RLS. Recall the average reduction ratio c in (fl4l l. and let 
Cn ■= (®o — Qn)l<y ~ A/^(0, K„) denote the normalized error 
at the n—th iteration. Similar to (IT4l i-(fT5ll. it holds that 

7r„(T„) = 1 - 2Q [x^K„_iX„ -f l] . (26) 

For n ^ p, estimates are sufficiently close to 60 and 
thus K„ Ri 0. Then, the data-agnostic t„ ~ 
attains an average censoring probability if, while its asymptotic 
properties have been studied in llT9l . For finite data, this simple 
rule leads to under-censoring by ignoring appreciable values 
of Kn, which can increase computational complexity con¬ 
siderably. This consideration motivates well the data-adaptive 
threshold selection rules designed next. 

AC-RLS updates can be seen as ordinary RLS updates 
on the subsequence of uncensored data. After ignoring the 
transient error due to initialization, it holds that K„ k, 
— Ci)xiX^] . The term x^K„_iX„ is encountered 
as x^C„_iX„/n in the updates of Alg. [3 but it is not com¬ 
puted for censored measurements. Nonetheless, x^C„_iX„/n 
can be obtained at the cost of p{p + 1) multiplications per 
censored datum. Then, the exact censoring probability at AC- 
RLS iteration n can be tuned to a prescribed tt* by selecting 

Tn = (x^C„_iX„/n -f 1)^^^ Q~^ (”~^) ■ 


Given {Ttn}n=i satisfying (0, an average censoring ratio of 
{D — d)/D is thus achieved in a controlled fashion. 

Although lower than that of ordinary RLS, the complex¬ 
ity of AC-RLS using the threshold selection rule (l27T i is 
still 0{Dp^). To further lower complexity, a simpler rule 
is proposed that relies on averaging out the contribution of 
individual rows x^ in the censoring process. Suppose that 
x„’s are generated i.i.d. with E[x„] = 0 and E[x„x^] = R^;. 
Similar to Section IIII-BI for p sufficiently large the inner 
product x^ is approximately Gaussian. It then follows that 
the a-priori error e„(6„_i) = is zero-mean 


Xn Cn —iCn —iXrj 


-f cr^ = 


Gaussian with variance = cr^E 

cr^tr^^E X„X^C„-lCn-l ) + = CT^tr (R^rK^.i) -f CT*, 

where the first equality follows from the independence of 
x^Cn-i Vn', and the third one from that of x„ with Cn-i- 


The censoring probability at time n is then expressed as 


7r„ = Pr{|e„(0„_i)| < tct} = 1-2(5 yn—j ■ 
To attain tt*, the threshold per datum n is selected as 


Tn = - Q 


I - Tt* 


(28) 


It is well known that for lar^e n, the RLS error covariance 
matrix K„ converges to —R-i Specifying 
equivalent to selecting an average number of X]r=i(l ~ 

RLS iterations until time n. Thus, the AC-RLS with con¬ 
trolled selection probabilities yields an error covariance matrix 
K„ Ri (Sr=i(l ~ Combined with (1^ . the 

latter leads to 


Plugging (Te„ into (1281) yields the simple threshold selection 


P 




+ 1 


\i=l 


1/2 


Q 


-1 


1 - tt! 


■ (29) 


Unlike dZTl l. where thresholds are decided online at an ad¬ 
ditional computational cost, ( |29] | offers an off-line threshold 
design strategy for AC-RLS. Based on ( |29] |. to achieve c = 
TT* = {D — d)/D, thresholds are chosen as 

1/2 


P 


(n — 1)(1 — TT*) 


-f 1 


Q 


-1 


l-TT’' 


(30) 


which attains a constant tt* across iterations. 


D. Robust AC-LMS and AC-RLS 

AC-LMS and AC-RLS were designed to adaptively select 
data with relatively large innovation. This is reasonable pro¬ 
vided that ([T]) contains no outliers whose extreme values may 
give rise to large innovations too, and thus be mistaken for 
informative data. Our idea to gain robustness against outliers 
is to adopt the modified AC rule 

( (1,0) , |e„(0„_i)| < err 

(c„,c°)=< (0,0) , rcr < |e„(0„_i)| < Tocr . (31) 

[ (0,1) , |e„(0„_i)| > Tocr 

Similar to ®, a nominal censoring variable c„ is activated 
here too for observations with absolute normalized innovation 
less than r. To reveal possible outliers, a second censoring 
variable c° is triggered when the absolute normalized innova¬ 
tion exceeds threshold Tq > r. 

Having separated data-censoring from outlier identification 
in (OTT l. it becomes possible to robustify AC-LMS and AC- 
RLS against outliers. Towards this end, one approach is to 
completely ignore when c° = 1. Alternatively, the instan¬ 
taneous cost function in (1211 1 can be modified to a truncated 
Huber loss (cf. 1^ ) 

r 0 ,(c„,c°) = (i,o) 

/°(en) = < , (c„, c°) = (0, 0) 

i roCr(|e„| - , (cn, c°) = (0,1) 
















Applying the first-order SGD iteration on the cost f°{en), 
yields the robust (r) AC-LMS iteration 


— ^n — l —l) 


(32) 


where 


D , (c„,c°) = (1,0) 

g„(0) = <( x„ (y„ - x^e„_i) , (c„, c°) = (0, 0) 

ToCTXn sign (j/„ - X^On-l) , (c„, C° ) = (0, 1) 

Similarly, the second-order SGD yields the rAC-RLS 


1 


— ^n — 1 3” CnSn(^n—l) 


(33a) 


C„ = 


n — 1 


C„_i- 


(1 - c„)(l - c°)C„_ix„x^C„_i 


n - 1 -h x^C„_iX„ 


(33b) 


Observe that when c° = 1, only is updated, and the 
computationally costly update of (I33bb is avoided. 


V. Numerical Tests 

A. SA-MLE 

The online SA-MLE algorithms presented in Section HHl are 
simulated using Gaussian data generated according to ([T]i with 
a time-invariant Oo £ R^, where p = 30, ^ A/'(0,1) and 

x„ ^ A/”(Op, Ip). The first K = 50 observations are used to 
compute 6 k- The first-and second-order SA-MLE algorithms 
are then run for D = 5 ,000 time steps. The NAC rule in (|4|i 
was used with r = 1.5 to censor approximately 75% of the 


| 6 >„ - Gr 


observations. Plotted in Eig. |2] is the MSE E 
across time n, approximated by averaging over 100 Monte 
Carlo experiments. Also plotted is the Cramer-Rao lower 
bound (CRLB) of the observations, given by modifying the 
results of lITSl to accommodate the NAC rule in (|4|l. It can be 
inferred from the plot that the second-order SA-MLE exhibits 
markedly improved convergence rate compared to its first- 
order counterpart, at the price of minor increase in complexity. 
Eurthermore, by performing a single pass over the data, the 
second-order SA-MLE performs close to the CRLB, thus 
offering an attractive alternative to the more computationally 
demanding batch Newton-based iterations in lfT9l and iTTSl . 

To further evaluate the efficacy of the proposed methods, 
additional simulations were run for different levels of censor¬ 
ing by adjusting r. Plotted in Pigs. |3(a)| and |3(b)| are the MSE 
curves of the first- and second-order SA-MLE respectively, 
for different values of r. Notice that censoring up to 50% of 
the data (green solid curve) incurs negligible estimation error 
compared to the full-data case (blue solid curve). In fact, even 
when operating on data reduced by 95% (red dashed curve) 
the proposed algorithms yield reliable online estimates. 


B. AC-LMS comparison with Randomized Kaczmarz 

The AC-LMS algorithm introduced in Section IIV-AI was 
tested on synthetic data as an alternative to the randomized 
Kaczmarz’s algorithm. Por this experiment, D — 30,000 
observations ?/„ were generated as in ([T]i with cr^ = 0.25, 
while the x„’s of dimension p = 100 were generated i.i.d. 


- first-order SA-MLE 


second-order SA-MLE 
CRLB 



0 500 1000 1500 2000 2500 3000 3500 4000 4500 


n 

Fig. 2. Convergence of first- and second-order SA-MLE {d/D = 0.25) . 



n 


(a) 



(b) 


Fig. 3. Convergence of (a) first-order SA-MLE; and (b) second-order SA- 
MLE for different values of r. 


following a multivariate Gaussian distribution. Por the ran¬ 
domized Kaczmarz’s algorithm, the probability of selecting the 
i—th row is p„ = ||x„||i/||X|||, Q. Since the computational 
complexity of the two methods is roughly the same, the 
comparison was done in terms of the relative MSE, namely 


E 


\eo-en\\li 


\eo 


. Plotted in Pig. |4] are the relative MSE 
curves of the two algorithms w.r.t. the number of data {x„, 
that were used to estimate 6 ^ (50 Monte Carlo runs). While 
the AC-LMS scans the entire dataset updating only informative 
data, the randomized Kaczmarz’s algorithm needs access only 
to the data used for its updates. This is only possible if the 
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Fig. 4. Relative MSE for AC-LMS and randomized Kaczmarz’s algorithms. (a) 


data-dependent selection probabilities are given a-priori, 
which may not always be the case. Regardless, two more 
experiments were run, in which the AC-LMS had limited 
access to 3,000 and 1,400 data. Overall, it can be argued 
that when the sought reduced dimension is small, the AC- 
LMS offers a simple and reliable first-order alternative to the 
randomized Kaczmarz’s algorithm. 

C. AC-RLS 

The AC-RLS algorithm developed in Section IIV-BI was 
tested on synthetic data. Specifically, the AC-RLS is treated 
here as an iterative method that sweeps once through the 
entire dataset, even though more sweeps can be performed 
at the cost of additional runtime. Its performance in terms 
of relative MSE was compared with the Hadamard (HD) 
preconditioned randomized LS solver, while plotted as a 
function of the compression ratio d/D. Parallel to the two 
methods, a uniform sampling randomized LSE was run as a 
simple benchmark. Measurements were generated according to 
([T]i with p = 300, D = 10,000, and ~ A/’(0, 9). Regarding 
the data distribution, three different scenario’s were examined. 
In Figure [5(a)l x„’s were generated according to a heavy tailed 
multivariate f—distribution with one degree of freedom, and 
covariance matrix with (i,j)-th entry ^ = 2 x 
Such a data distribution yields matrices X with highly non- 
uniform leverage scores, thus imitating the effect of a subset 
of highly “important” observations randomly scattered in the 
dataset. In such cases, uniform sampling without precon¬ 
ditioning performs poorly since many of those informative 
measurements are missed. As seen in the plot, precondi¬ 
tioning significantly improves performance, by incorporating 
“important” information through random projections. Further 
improvement is effected by our data-driven AC-RLS through 
adaptively selecting the most informative measurements and 
ignoring the rest, without overhead in complexity. 

The experiment was repeated (Fig. |5(b)| l for x„ generated 
from a multivariate f—distribution with 3 degrees of freedom, 
and S as before. Leverage scores for this dataset are moder¬ 
ately non-uniform, thus inducing more redundancy and result¬ 
ing in lower performance for all algorithms, while closing the 
“gap” between preconditioned and non-preconditioned random 
sampling. Again, the proposed AC-RLS performs significantly 



(b) 



(c) 


Fig. 5. Relative MSE of AC-RLS and randomized LS algorithms, for different 
levels of data reduction. Regression matrix X was generated with highly non- 
uniform (a), moderately non-uniform (b), and uniform leverage scores (c). 


better in estimating the unknown parameters for the entire 
range of data size reduction. 

Finally, Fig. |5(c)| depicts related performance for Gaussian 
x„ ^ A/^(0,I]). Compared to the previous cases, normally 
distributed rows yield a highly redundant set of measurements 
with X having almost uniform leverage scores. As seen in 
the plots, preconditioning offers no improvement in random 
sampling for this type data, whereas the AC-RLS succeeds in 
extracting more information on the unknown 0 . 

To further assess efficacy of the AC-RLS algorithm, real 
data tests were performed. The Protein Tertiary Structure 
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Fig. 7. Relative MSE of AC-RLS, rAC-RLS, and randomized LS algorithms. 
Fig. 6. Relative MSE of AC-RLS and randomized LS algorithms, for different fQj- different levels of data reduction using an outlier-corrupted dataset, 
levels of data reduction using the protein tertiary structure dataset. 


dataset from the UCI Machine Learning Repository was tested. 
In this linear regression dataset, p = 9 attributes of proteins 
are used to predict a value related to protein structure. A total 
of D = 45,730 observations are included. Since the true 
6 o is unknown, it is estimated by solving LS on the entire 
dataset. Subsequently, the noise variance is also estimated 
via sample averaging as cr^ = (1/L>) (j/n - 

Figure |6] depicts relative squared-error (RSE) with respect to 
the data reduction ratio d/D. The RSE curve for the HD- 
preconditioned LS corresponds to the average RSE across 
50 runs, while the size of the vertical bars is proportional 
to its standard deviation. Different from RP-based methods, 
the RSE for AC-RLS does not entail standard deviation bars, 
because for a given initialization and data order, the output 
of the algorithm is deterministic. It can be observed that 
for d/D > 0.25 the AC-RLS outperforms RPs in terms 
of estimating 6 , while for very small d/D, RPs yield a 
lower average RSE, at the cost however of very high error 
uncertainty (variance). 


D. Robust AC-RLS 

To test rAC-LMS and rAC-RLS of Section IIV-DI datasets 
were generated with D — 10,000, p = 30 and x„ ^ 
A/’(0, S), where = 2 x noise was i.i.d. Gaussian 

Vn ^ A/'(0,9); meanwhile measurements ijn were generated 
according to O with random and sporadic outlier spikes 
{on}n=i- Specifically, we generated o„ = a„/3„, where 
a„ ~ Bernoulli(0.05), and /3„ ^ M{f), 25 x 9), thus resulting 
in approximately 5% of the data effectively being outliers. 
Similar to previous experiments, our novel algorithms were 
run once through the set selecting d out of D data to update 
On- Plotted in Pig. |7]is the RSE averaged across 100 runs as a 
function of d/D for the HD-preconditioned LS, the plain AC- 
RLS, and the rAC-RLS with a Huber-like instantaneous cost. 
As expected, the performance of AC-RLS is severely under¬ 
mined especially when tuned for very small d/D, exhibiting 
higher error than the RP-based LS. However, our rAC-RLS 
algorithm offers superior performance across the entire range 
of d/D values. 


VI. Concluding Remarks 

We developed online algorithms for large-scale LS linear 
regressions that rely on censoring for data-driven dimension¬ 
ality reduction of streaming Big Data. Pirst, a non-adaptive 
censoring setting was considered for applications where ob¬ 
servations are censored - possibly naturally - separately and 
prior to estimation. Computationally efficient first- and second- 
order online algorithms were derived to estimate the unknown 
parameters, relying on stochastic approximation of the log- 
likelihood of the censored data. Performance was bounded 
analytically, while simulations demonstrated that the second- 
order method performs close to the CRLB. 

Purthermore, online data reduction occurring parallel to 
estimation was also explored. For this scenario, censoring 
is performed deliberately and adaptively based on estimates 
provided by first- and second-order algorithms. Robust ver¬ 
sions were also developed for estimation in the presence of 
outliers. Studied under the scope of stochastic approximation, 
the proposed algorithms were shown to enjoy guaranteed 
MSE performance. Moreover, the resulting recursive methods 
were advocated as low-complexity recursive solvers of large 
LS problems. Experiments run on synthetic and real datasets 
corroborated that the novel AC-LMS and AC-RLS algorithms 
outperformed competing randomized algorithms. 

Our future research agenda includes approaches to nonlinear 
(e.g., kernel-based) parametric and nonparametric large-scale 
regressions, along with estimation of dynamical (e.g., state- 
space) processes using adaptively censored measurements. 


Appendix 

Proof of Proposition\I\ It can be verified that ^ 

0, which implies the convexity of £„(0) ifTSl . The regret of 
the SGD approach is then bounded as ifTOl Corollary 2.7] 

D 

""" -Ij 


1 


R{.D)<j-\\e*-e,\\l + pY.\\^^n{en-i 

^ n=l 




D 

n—1 
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where {0n}^=i is any sequence of estimates produced by 
the SA-MLE. By choosing fi — ||0* — 0k\\ 2 /W‘^■Dj3x), the 
aforementioned bound leads to Proposition [T] ■ 

Proof of Proposition^ For the SGD update in (|22]) . the 
MSE Ex,« [ll^n — with 6o = argmineF(0) where 

F{9) Ex,« [/*'^n®;y)] i^ bounded as in ||30l. For this to 
hold, we must have: al) the gradient bounded at the optimum; 
that is, Ex,„ [II y)|||] < A; a2) the gradient must 
be L—smooth for any other 0; and a3) F{9) must be a- 
strongly convex ll^ . With x and v generated randomly and 
independently across time, associated quantities do not depend 
on n. Furthermore, the points of discontinuity of ) are 
zero-measure in expectation, and thus are neglected for brevity. 

Under a3), there exists a constant a > 0 such that 
V‘^F{9) ^ al V0. Interchanging differentiation with expecta¬ 
tion yields 


= v^Ex.„ 


= Ex 




= Ex,„ [xx^(l - c)] 


/ / xx^l{|xr(e„-e)+„|>r<T}P«(t')Pa:(x)dudx 

J "X. J V 


= / XX 


+ Q { T — 




x^{9o - 9) 


x^(0o - 9) 


Px{x.)d:s. 


= / XX 


Q I T + 


-xf'{9o - 9) 


Q{t- 


x'^(9o - 9) 


Pa:{x)dx. 


It can be verified that the function g(z) := Q{t+z) + Q{t— z) 
is minimized for z = 0 when r > 0. To see this, observe that 
its derivative g'{z) = —(/)(r -P z) F (j){T — z) vanishes when 
|t -f z| = |r — z|. Therefore, g{z) > g{0) = 2Q(t) for all z; 
and hence. 


Q 




+ Q 




> 2Q(t) 


for all X and 9. The latter implies 


y‘^F{9) F f xx^2Q{T)pj;{x)dx = 2Q{t)II^ 

J X 

^ 2(5(T)Amin(R.a;)I 


showing that F{9) is a—strongly convex with a = 
2Q(T)Amin(Rx)- As expected, a reduces for increasing r. 

Regarding the instantaneous gradient, it suffices to 
find L such that Ex,« [||V/(^)(6>i) - V/(^)(02)||i] < 

L‘^\\ei-e 2 \\l for all n and any pair (0i, 02)- For the errors 


= E: 
= E. 


Ci '■= 9o — 9i for i = 1, 2, it holds 

Ex,„ 

[||xe( 0 i)(l - Cl) - xe(02)(l - C2)||i] 

Cl “ 1 “'^)l{|x^^i+i;|>Tcr} 


X.i; 


'x,v 

T 


- x(x^ C 2 +1')l{|xrC2-H'«l> 


r^T}ll2 


|xx ]L||x'rj^^_|_.j;|>.7-^| XX C2 ^"^^l 


= Ex 

^ \ ^ \ ^ 

^X,U Cl j C2 

— 2 C,i (xx ) C2l{|x'rCi-l-«l>'ro’} 

-f X XX Cl {l{|x'^(^j^-t-lj|>Tcr} ^{|x^C 2 +'^l— 

X XX C2 |x^^2-t-^l ^'^^ 1 ^ ^{I^^C2+^I— 

T l|x||2t^ (^{|x^Cl-l-«|>Tcr} ■ (34) 

It can be verified that since the cross-terms in (l34l) can be 
bounded from below and above as 


Ex [x^xx^] Cl-^(Cl,C2) < Ex[x^XX^Cl 

X E^ [l{|xr ^i+'u|>ri7}^ (i{| 

<Ex [x^xx^]CiC/(Ci,C2) 


x^Cl-r^l^Tcr} l{|x^C2+’'l 




they are also equal to zero if the third-order moment 
Ex [x^xx^l = 0. Furthermore, by simply bounding 


Et, [l{|x^Ci+t'l>To-}] < 1 as probabilities, (l34l i yields 


(Ci-C2)^ (xx^)^(Ci-C2) 


E[||V/( 0 i)-V/( 02 )||^] 

+ llxllaEt, (l{|xrCi-|-«|>ra-} - l{|x^C2-l-’'l>-r'^})^ 

= (Cl ~ C2)^®X (xx^) (Cl — C2) 


■Ex 


|x||oE„ 


(1{ 


|x^<i-|-«|>rcr} ^{|x'^C 2 -l-’'l 


>rcr}) 


< (A„,ax(E[(xX^)']) +A.) ||01-02||i 


The last expression reveals that the average distance between 
gradients can be decomposed into two terms. The first term 
can be bounded using the fourth-order moment. The second 
term appears due to data censoring and clearly depends on r, 
while it is assumed bounded as 


Ex 


Ixll^E. 


2^v 
1 


(1{| 






< \ r \\ 9 ,- 92 \\l 


Although we could not express Ar in closed form, for rel¬ 
atively small values of r used in practice to censor more 
than 90% of the measurements, Xt r; 0; thus, the second 
term can be ignored yielding Ri Amax ^E (xx^)^ 
Furthermore, even for large t some inaccuracy in the value 
of L can be tolerated, after considering that it does not affect 
the algorithm’s stability or asymptotic performance when a 
vanishing step size is used. 
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Finally, the expected norm of the gradient at 0 = is 
bounded and equal to 


E 


l|V/(")(eo)|| 


= Ex [||x| 
= tr(R^) 


2] Ew 




,^2 2 

G — G 


E [||x||ie(6»o)(l - c 

)] 

l{|?;|>Tcr}] 



r ^2 

2 e“^ 

V , dv 

VSttct^ 



Q (-) - 

L \g/ g 

B] 

T<7 

— T<7 


= tr(R2:) 

= 2cr^tr(Ra,) (1 - Q[t) + r(/)(T)) 

which completes the proof. ■ 

Proof of Proposition]^ For the error vector Cn ■= “ 

n 

60 , AC-RLS satisfies ^ XiUi(l —Ci). If are 

2—1 

deterministic and given, the error covariance matrix K„ := 
E[CnCn] becomes 


K„, = Ex 


= Ex 


- Ci)(l - Cj)C„ 

i=l i=l 


C„^^XjxJE„ [v^Vj] (1 - Ci)(l - Cj)Cr. 

i=l i=l 


= cr^Ex 


Cn ^ ^ X^X^ (1 Ci)C7^ 

i=l 

= (t2Ex [C„C“1C„] =a"E,,[C„] 

Assuming x„xj(l — c„) to be ergodic and for large enough 

n 

n, the matrix C~^ = ^Xixf(l — Ci) can be approx- 

2 — 1 

imated by nEx,,, [xx^(l — c)] = nEx [xx^E.„[l — c]] = 
nEx [xx^ Pr{c = 0|x}] = Given that 2Q{t) < 

Pr{c = 0|x} < 1 Vx, we obtain 

2Q(r)nR^ ^ A nR^;. 

Since C„ converges monotonically to Coo, there exists A: > 0 
such that for all n > fc 


1 


_J_'D —1 ^ ^ 

n " - " - 2Q{T)n 


R; 


The result follows given that E [||0„ — ©oHi] = tr(K„) = 
(T^tr(E [C„]). ■ 
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